19 research outputs found

    Utilisation efficace des accélérateurs GPU -- Ordonnancement sur machines hybrides

    Get PDF
    National audienceThe race for ever more computing power raises the issue of supercomputers' power consumption. Heterogeneous architectures - composed of processor and GPU accelerators - seem to be a promising answer. Scheduling on such machines is nowadays relying on the programmer's skill set and made in a static way. We study the problem of scheduling of independent tasks on such architectures. We propose a bi-objective approximation algorithm which simultaneously optimizes the makespan and the affinity. The provided algorithm has a low complexity. We then validate the performance of its implementation within the framework XKaapi.La course à la puissance de calcul dans les super-calculateurs pose la problématique de la consommation énergétique de ces machines. Les systèmes hybrides - composés de processeurs et d'accélérateurs GPU (Graphics Processing Unit) - sont une réponse prometteuse à cette question. Actuellement, l'allocation des tâches sur des telles machines est réalisée par le programmeur de manière statique. Nous étudions le problème de l'ordonnancement de tâches indépendantes sur ces architectures. Nous proposons un algorithme d'approximation bi-critère - de faible complexité algorithmique - optimisant simultanément localité et temps de complétion avec des garanties de performance. Les performances de son implémentation dans l'environnement de calcul parallèle XKaapi sont ensuite validées par une étude expérimentale

    Scheduling independent tasks on multi-cores with GPU accelerators

    Get PDF
    International audienceMore and more computers use hybrid architectures combining multi-core processors and hardware accelerators like GPUs (Graphics Process-ing Units). We present in this paper a new method for scheduling efficiently parallel applications with m CPUs and k GPUs, where each task of the appli-cation can be processed either on a core (CPU) or on a GPU. The objective is to minimize the maximum completion time (makespan). The corresponding scheduling problem is NP-hard, we propose an efficient approximation algo-rithm which achieves an approximation ratio of 4 3 + 1 3k . We first detail and analyze the method, based on a dual approximation scheme, that uses dynamic programming to balance evenly the load between the heterogeneous resources. Then, we present a faster approximation algorithm for a special case of the previous problem, where all the tasks are accelerated when affected to GPU, with a performance guarantee of 3 2 for any number of GPUs. We run some simulations based on realistic benchmarks and compare the solutions obtained by a relaxed version of the generic method to the one provided by a classical scheduling algorithm (HEFT). Finally, we present an implementation of the 4/3-approximation and its relaxed version on a classical linear algebra kernel into the scheduler of the xKaapi runtime system

    Playing with power at runtime: Slightly slowed applications, major energy savings

    Get PDF
    National audienceSoberness—in terms of electrical power—of Data Centers and high-performance computing (HPC) systems is becoming an important design issue, as the global energy consumption of Information Technologies (IT) is rising at considerable levels. This question is all the more complex as these systems are increasingly heterogeneous and variable in their behavior with respect to their performance and power consumption. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Additionally, applications tend to present phases (I/O, computing- or memory-intensive, check-pointing) which vary over time, and to be executed on an environment subject to external constraints (e.g., concurrency or energy envelop).This increasing complexity makes HPC less predictable offline (prior to the execution). Therefore, dealing with time variations and unpredictable disturbances demands runtime management. In this work, we realize dynamical adaptation using feedback control, falling into the scope of autonomic computing, using control theory. Particularly, we address the problem of the control of the power allocated to processors, and hence their energy consumption and performance. The use of feedback control allows to reduce the energy consumption by decreasing the speed with limited and configurable performance loss, by exploiting periods where read/write operations slow down the progress. The proposed controller has an easily configured behavior: the user has to supply only an acceptable degradation level. An HPC application such as our system undergoes many variations of its behavior, depending on (i) the cluster, (ii) the node, (iii) the run, and even (iv) during the runtime.We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark. Our results show the existence of a family of trade-offs to save energy, depending on the allowed degradation (from 0 to 20%). In particular, our control approach allows, on average, saving 22% energy at the cost of a 7% execution time, and climbs up to 25% energy savings with the adaptation. Our solution has shown to be robust to variations of the machines (from one node to another) and of the runs (from one execution of the application to another).The experiments conducted in this work require to instrument low-level software stacks. Conducting this work on top of Grid'5000 was key as it allowed us to study various hardware setups (varying number of sockets, varying amount of memory) and their impact on the controller. The presence of clusters composed of homogeneous hardware allowed us to study the robustness of the devised control with respect to the variability in hardware performance despite identical specifications. Finally, our work relied on power measures as provided by the integrated sensors: we could extend this work by exploiting the available power sensors.Our future works will tackle three remaining challenges: (i) handling various types of phases and their chaining in a application, (ii) distributed execution (different powercap enforced on each processor or core) and (iii) non-instrumented applications (for which an instrumentation is not possible)

    A Methodology for Handling Data Movements by Anticipation: Position Paper

    Get PDF
    The enhanced capabilities of large scale parallel and distributed platforms produce a continuously increasing amount of data which have to be stored, exchanged and used by various tasks allocated on different nodes of the system. The management of such a huge communication demand is crucial for reaching the best possible performance of the system. Meanwhile, we have to deal with more interferences as the trend is to use a single all-purpose interconnection network whatever the interconnect (tree-based hierarchies or topology-based heterarchies). There are two different types of communications, namely, the flows induced by data exchanges during the computations, and the flows related to Input/Output operations. We propose in this paper a general model for interference-aware scheduling, where explicit communications are replaced by external topological constraints. Specifically, the interferences of both communication types are reduced by adding geometric constraints on the allocation of tasks into machines. The proposed constraints reduce implicitly the data movements by restricting the set of possible allocations for each task. This methodology has been proved to be efficient in a recent study for a restricted interconnection network (a line/ring of processors which is an intermediate between a tree and higher dimensions grids/torus). The obtained results illustrated well the difficulty of the problem even on simple topologies, but also provided a pragmatic greedy solution, which was assessed to be efficient by simulations. We are currently extending this solution for more complex topologies. This work is a position paper which describes the methodology, it does not focus on the solving part

    Scheduling Independent Moldable Tasks on Multi-Cores with GPUs

    Get PDF
    The number of parallel systems using accelerators is growing up.The technology is now mature enough to allow sustainedpetaflop/s. However, reaching this performance scale requiresefficient scheduling algorithms to manage the heterogeneouscomputing resources.We present a new approach for scheduling independent tasks onmultiple CPUs and multiple GPUs. The tasks are assumed to beparallelizable on CPUs using the moldable model: the final numberof cores allotted to a task can be decided and set by thescheduler. More precisely, we design an algorithm aiming atminimizing the makespan---the maximum completion time of alltasks---for this scheduling problem. The proposed algorithmcombines a dual approximation scheme with a fast integer linearprogram (ILP). It determines both the partitioning of the tasks,ie whether a task should be mapped to CPUs or a GPU, and thenumber of CPUs allotted to a moldable task if mapped to the CPUs.A worst case analysis shows that the algorithm has anapproximation ratio of 32+ϵ\frac{3}{2} + \epsilon. However, sincethe complexity of the ILP-based algorithm could benon-polynomial, we also present a proved polynomial-timealgorithm with an approximation ratio of 2+ϵ2+\epsilon.We complement the theoretical analysis of our two novelalgorithms with an experimental study. In these experiments, wecompare our algorithms to a modified version of the classical\heft algorithm, adapted to handle moldable tasks. Theexperimental results show that our algorithm with the32+ϵ\frac{3}{2} + \epsilon approximation ratio producessignificantly shorter schedules than the modified \heft for mostof the instances. In addition, the experiments provide evidencethat this ILP-based algorithm is also practically able to solvelarger problem instances in a reasonable amount of time

    Composition of Scheduling and Control Theory Techniques

    No full text
    International audienceThe management and allocation of resources to users in HPC infrastructures often relies on the RJMS.One key component for an optimized resource allocation, with respect to some objectives, is the scheduler.Scheduling theory is interesting as it provides algorithms with performance guarantees.These guarantees come at the cost of tedious and complex modeling effort.The growing complexity of nowadays and future platforms(hardware heterogeneity, memory/bandwidth/energy constraints)do push to its limit the scheduling approach.Taking into account this new challenges either requires the design of new overly complex models,or exhibits the crudeness of the model.In a sense, the scheduling approach fails to capture the dynamic aspects of the platforms.From the control theory point of view, scheduling algorithms are open-loop systems: the actual state of the platform is not fed back into the decision process.By closing the loop and using some control theory results/techniques, we propose to study how to combine both techniques.This study would take place at various levels: from theory to actual applications

    Composition of Scheduling and Control Theory Techniques

    No full text
    International audienceThe management and allocation of resources to users in HPC infrastructures often relies on the RJMS.One key component for an optimized resource allocation, with respect to some objectives, is the scheduler.Scheduling theory is interesting as it provides algorithms with performance guarantees.These guarantees come at the cost of tedious and complex modeling effort.The growing complexity of nowadays and future platforms(hardware heterogeneity, memory/bandwidth/energy constraints)do push to its limit the scheduling approach.Taking into account this new challenges either requires the design of new overly complex models,or exhibits the crudeness of the model.In a sense, the scheduling approach fails to capture the dynamic aspects of the platforms.From the control theory point of view, scheduling algorithms are open-loop systems: the actual state of the platform is not fed back into the decision process.By closing the loop and using some control theory results/techniques, we propose to study how to combine both techniques.This study would take place at various levels: from theory to actual applications

    moldableILP

    No full text
    This repository contains the source code used to produce the simulation results of the following publication (which is currently in print).R. Bleuse; S. Hunold; S. Kedad-Sidhoum; F. Monna; G. Mounie; D. Trystram, "Scheduling Independent Moldable Tasks on Multi-Cores with GPUs," in IEEE Transactions on Parallel and Distributed Systems , vol.PP, no.99, pp.1-1doi: 10.1109/TPDS.2017.267589

    Playing with power at runtime: Slightly slowed applications, major energy savings

    No full text
    National audienceSoberness—in terms of electrical power—of Data Centers and high-performance computing (HPC) systems is becoming an important design issue, as the global energy consumption of Information Technologies (IT) is rising at considerable levels. This question is all the more complex as these systems are increasingly heterogeneous and variable in their behavior with respect to their performance and power consumption. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Additionally, applications tend to present phases (I/O, computing- or memory-intensive, check-pointing) which vary over time, and to be executed on an environment subject to external constraints (e.g., concurrency or energy envelop).This increasing complexity makes HPC less predictable offline (prior to the execution). Therefore, dealing with time variations and unpredictable disturbances demands runtime management. In this work, we realize dynamical adaptation using feedback control, falling into the scope of autonomic computing, using control theory. Particularly, we address the problem of the control of the power allocated to processors, and hence their energy consumption and performance. The use of feedback control allows to reduce the energy consumption by decreasing the speed with limited and configurable performance loss, by exploiting periods where read/write operations slow down the progress. The proposed controller has an easily configured behavior: the user has to supply only an acceptable degradation level. An HPC application such as our system undergoes many variations of its behavior, depending on (i) the cluster, (ii) the node, (iii) the run, and even (iv) during the runtime.We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark. Our results show the existence of a family of trade-offs to save energy, depending on the allowed degradation (from 0 to 20%). In particular, our control approach allows, on average, saving 22% energy at the cost of a 7% execution time, and climbs up to 25% energy savings with the adaptation. Our solution has shown to be robust to variations of the machines (from one node to another) and of the runs (from one execution of the application to another).The experiments conducted in this work require to instrument low-level software stacks. Conducting this work on top of Grid'5000 was key as it allowed us to study various hardware setups (varying number of sockets, varying amount of memory) and their impact on the controller. The presence of clusters composed of homogeneous hardware allowed us to study the robustness of the devised control with respect to the variability in hardware performance despite identical specifications. Finally, our work relied on power measures as provided by the integrated sensors: we could extend this work by exploiting the available power sensors.Our future works will tackle three remaining challenges: (i) handling various types of phases and their chaining in a application, (ii) distributed execution (different powercap enforced on each processor or core) and (iii) non-instrumented applications (for which an instrumentation is not possible)

    moldableILP

    No full text
    This repository contains the source code used to produce the simulation results of the following publication (which is currently in print).R. Bleuse; S. Hunold; S. Kedad-Sidhoum; F. Monna; G. Mounie; D. Trystram, "Scheduling Independent Moldable Tasks on Multi-Cores with GPUs," in IEEE Transactions on Parallel and Distributed Systems , vol.PP, no.99, pp.1-1doi: 10.1109/TPDS.2017.267589
    corecore